Extract single elements

Internet Macros can extract data from Web sites. Click on the EXTRACT button while in recording mode to bring up the extraction wizard.

Note: Internet Explorer 6.0 or better installed is required for the EXTRACT command.

To define an EXTRACT tag, proceed as follows:

(1) Open the Extraction Wizard (EXTRACT button on control panel). I

Note: If the information you want to extract is inside a framed web site, you need to click inside the frame that contains the information you want to extract before opening the Extraction Wizard. This generates the FRAME command and marks the frame as active for the extraction.

(2) In the browser window or frame, select the text that you want to extract.

(3) Click the "Suggest" Button. IIM creates a suggestion for the extraction anchor of the extraction tag. IIM later uses this anchor to return to the position where you want to extract text.

(4) Click TEST to test run the extraction tag.

(5) If you are satisfied with the result, click "Add" to add the EXTRACT statement to the macro.

If you extract a complete table, the table data is automatically converted into comma-separated data (see the "demo-extract-table" macro). This is a very powerful feature. It allows you to get the data of a complete table with only one command!

There are two methods to retrieve extracted data:

1.You can save extracted data directly to a file by adding "SAVEAS TYPE=EXTRACT" manually to the macro. All items that were extracted before the SAVEAS command are saved to a file in one row like "item1 , item2 , item 3, ...". With the next start of the macro or the next round of a LOOP, a new line is added to the file. The default file name is "extract.csv". The file name can be changed with "SET !FILEEXTRACT newname.txt".  
2.If you do not use a "SAVEAS TYPE=EXTRACT" command in the macro, IIM returns the data to your code via the Scripting Interface.  

Tip: Sometimes IIM can not suggest a proper EXTRACT tag automatically. In this case you can create one manually and test it with the TEST button in the Extraction Wizard.

Note: If one or more EXTRACT command are present in a macro, the return code is "2" if the macro completed successfully, even if one or more of the EXTRACT command failed because the extraction anchor was not found. Typically this happens because the web page changed.

The reason for this behavior that a macro can have many EXTRACT command and often (only) one of them does not find the extraction anchor. In this case the string "#EANF#" is returned. "EANF" stands for "Extraction anchor not found". So if you want to check if a particular EXTRACT command was successful, you just need to check if "#EANF#" is present in the returned string. Often, this can be very useful, for example if you use EXTRACT to check if a keyword is present on a page. A returned string with #EANF# in it, indicates then that the keyword is not found.

The text of a popup can be extracted with EXTRACTPOPUP. Add this command to your macro after the TAG command that triggers the popup.

Note 1: You can also extract values from INPUT boxes or SELECT (drop down lists) as well. For SELECT boxes, the currently active value is extracted. If you want to select ALL values of a drop down list, manually add #ALL# before the attribute.

Example:

Select currently active values:
EXTRACT POS=1 ELEM=0 ATTR=<SELECT<SP>size=1<SP>name=main>*

Select all values in a list:
EXTRACT POS=1 ELEM=0 ATTR=#ALL#<SELECT<SP>size=1<SP>name=main>*

Note 2: Some web pages make use of a "<PRE ...>" in their HTML code. It marks the text as "preformatted" -- all the spaces and carriage returns are rendered exactly as you type them. The information inclosed in a <PRE> tag is extracted correctly (including the formatting!) by Internet Macros. Thus if you transfer the extracted data to via the Scripting Interface all formatting information is retained unchanged. The formatting is only changed on two occasions: During the display in the test dialog box the line breaks are removed. The line breaks are also removed if you use the SAVE TYPE=EXTRACT method. This is necessary to ensure proper formatting of the CSV formatted text file, as in the CSV format, a line break would start a new line.



Example 1:

HTML page: 

<HTML>
<B>Hello World</B>
<B>
Text to be extracted</B>
<B>Good Morning</B>
<B>Good Afternoon</B>
<B>(c) iOpus</B>

<HTML>

To extract the text "
Text to be extracted" you use

EXTRACT POS=2 ELEM=0 ATTR=<B>*

The POS=2 statement indicates that Internet Macros must search for the second occurrence of the extraction anchor on the web page. The ATTR tag contains the HTML extraction anchor, in this case a simple<B> tag plus the wild card symbol "*". IIM searches for this tag and extracts the content found there. (ELEM is always 0, this value is reserved for later use.)

Important: You MUST use "*" at the end of the extraction anchor to tell Internet Macros that it should ignore the rest of the element when searching for the anchor. Other possible extraction anchors in this case could be "<B>Te*" or "<B>Text<SP>to*". But they are not recommend because they will fail if "Text to be extracted" is changed to "Data to be extracted" on the web page.

Example 2:

HTML: <li><nobr><font face="Verdana" size="-1"><b>Salary:</b>33,000.00 per year</font></nobr></li>

EXTRACT POS=1 ELEM=0 ATTR=<FONT<SP>face=Verdana<SP>size=-1><B>Salary:</B>*


In this example we have a longer extraction anchor. As you can see, you basically cut the HTML in half to get the extraction tag and add "*" at the end. You can add "*" at multiple locations if required:

EXTRACT POS=1 ELEM=0 ATTR=<FONT<SP>face=*<SP>size=*><B>Salary:</B>*


Since the extraction anchor occurs only once on the Web page, POS has the default value of 1. The elements <li> and <nobr> are not used for comparison. The word "salary" is fixed and can be used as part of the extraction anchor. The extract text is "Salary: 33,000.00 per year".